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ABSTRACT 

For  this  study,  we  address  the  problem  of  in-set/out-of-set 
speaker  recognition  with  sparse  enrollment  data.  Sparse  en¬ 
rollment  data  presents  a  unique  challenge  due  to  a  lack  of 
acoustic  space  coverage.  The  proposed  algorithm  focuses 
on  filling  acoustic  holes  and  fortifying  the  phone  expecta¬ 
tion  in  the  test  stage.  This  scheme  is  possible  by  using  the 
GMM  model  to  classify  the  speaker  phone  information  at  the 
feature  level.  The  parallel  training  for  most  occurred  (top) 
and  less  occurred  (bottom)  rank  ordered  mixture  classification 
(speaker  phone  class)  information  is  called  “Sweet- 16”,  and 
the  employing  a  test  data  mixture  histogram  using  the  Sweet- 
16  is  called  “Sweet-16  On-The-Fly  (OTF)”.  The  Sweet-16 
OTF  method  is  evaluated  using  telephone  conversation  speech 
from  the  FISHER  corpus.  The  Sweet-16  OTF  improves  on 
average  2.17%  absolute  EER  over  the  previous  Sweet-16,  and 
average  4.03%  absolute  EER  over  GMM-UBM  baseline  us¬ 
ing  2sec  test  data.  The  proposed  algorithm  improvement  is 
a  noteworthy  stage  to  compensate  for  both  sparse  enrollment 
data  and  limited  test  data. 

Index  Terms —  in-set/out-of-set  speaker  recognition,  co¬ 
hort  speakers,  data  sparseness,  speaker  adaptation,  speaker 
similarity 

1.  INTRODUCTION 

In-set/out-of-set  speaker  recognition  provides  a  binary  de¬ 
cision  for  a  claimed  speaker  based  on  a  predefined  speaker 
model  from  an  in-set  group.  The  extended  application  can 
be  found  in  identifying  speakers  in  a  multi-speaker  conversa¬ 
tion  or  broadcast  news,  or  the  system  grants  security  access 
for  a  specific  group  in  organizations.  A  speaker’s  intrinsic 
and  extrinsic  traits  have  previous  been  studied  to  achieve  ro¬ 
bust  speaker  recognition  using  clustering[l],  discriminative 

This  project  was  funded  by  AFRL  under  a  subcontract  to  RADC  Inc. 
under  FA8750-05-C-0029,  and  by  Univ.  of  Texas  at  Dallas  under  project 
EMMITT. 


training[2],  or  high  level  information[3].  The  Gaussian  Mix¬ 
ture  Model  (GMM)  provides  robust  text  independent  speaker 
recognition  system[4][2].  The  statistical  model  represents  the 
most  common  characteristics  of  the  available  speaker  data. 
The  speaker  independent  model  is  constructed  with  a  devel¬ 
opment  speaker  group  to  represent  out-of-set  speaker  known 
as  Universal  Background  Model  (UBM).  As  the  out-of-set 
speaker  group  becomes  larger,  the  UBM  plays  a  crucial  role 
to  decide  the  speakers  identify,  such  as  in  the  NIST  Speaker 
Recognition  Evaluation  (SRE)  task. 

In  this  study,  we  focus  on  sparse  enrollment  data  (5sec) 
with  short  test  utterances  (2~6  sec)  for  the  in-set/out-of-set 
problem.  The  sparse  enrollment  data  results  in  a  unique  chal¬ 
lenge  due  to  a  lack  of  acoustic  phone  coverage  compared  with 
longer  conversational  speech  data,  and  the  acoustic  phone 
coverage  becomes  of  high  risk  to  evaluate  with  short  test  ut¬ 
terances.  We  called  this  phenomena  the  “acoustic  hole  in  the 
acoustic  model  space”.  We  focus  here  to  fill  the  acoustic  holes 
and  fortifying  the  phoneme  in  the  sparse  enrollment  data  to  re¬ 
duce  the  equal  error  rate  (EER)  of  system  performance.  The 
phone  classification  is  achieved  using  a  Speaker  Independent 
GMM  (S.I.GMM),  and  the  classified  speaker  phone  informa¬ 
tion  facilitates  the  speaker  model  to  fill  acoustic  holes  and  to 
reinforce  the  phones  not  seen  in  the  enrollment  stage.  The 
proposed  system  attempts  to  achieve  a  major  impact  by  em¬ 
ploying  a  test  data  phone  information  distribution.  If  the  test 
data  is  shorter  than  the  enrollment  data,  the  proposed  algo¬ 
rithm  focuses  on  fortifying  the  expecting  phones  in  the  test 
stage.  The  resulting  speaker  model  focuses  only  on  2sec  test 
data  phone  information,  so  the  model  will  generally  have  bet¬ 
ter  discrimination  for  2sec  data.  Eor  other  cases,  the  longer 
test  data  provides  further  information  an  phoneme  coverage 
than  in  the  enrollment  data.  Here,  separate  training  for  the 
top  and  bottom  rank  ordered  mixture  index  classification  in¬ 
formation  is  called  “Sweet- 16”,  and  employing  the  test  frame 
data  mixture  index  histogram  labeled  using  the  Sweet- 16  is 
called  “Sweet- 16  On-The-Ely  (GTE)”.  This  approach  identi¬ 
fies  the  acoustic  holes  with  more  information  to  increase  the 
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probability  of  filling  acoustic  holes  using  a  parallel  training 
strategy. 

This  paper  is  organized  as  follows.  Sec.  2  explains  the 
baseline  system  for  evaluating  the  proposed  algorithm.  Sec. 
3  presents  motivation  and  a  detailed  procedure  for  developing 
the  proposed  algorithm.  Next,  the  evaluation  and  results  of 
proposed  algorithm  is  accessed  with  baseline  systems  in  Sec. 

4.2.  Finally,  conclusions  and  future  work  is  discussed  in  Sec. 
5. 

2.  BASELINE  SYSTEM 

2.1.  In-set/Out-of-set  Speaker  Recognition 

We  assume  we  are  given  a  set  of  in-set  (enrolled)  speakers, 
and  an  organized  collected  data  set  X„,  corresponding  to  each 
enrollment  speaker  S'„,  1  <  n  <  N inset-  Let  the  data 
Xq  represent  all  outside  non-enrolled  speakers  in  the  devel¬ 
opment  set.  Each  speaker  dependent  statistical  model  A„, 
{A„  ^  .A.,  1  Tt  ^^in-set'\i  Can  be  obtained  from  In 
the  first  stage,  called  (closed-set)  speaker  identification,  we 
first  classify  X  into  one  of  the  most  likely  in-set  speakers  A* 
as 

A*  =  argmax  p(X|A„)  (1) 

l<n<7Vi„_„et 

In  the  second  stage,  called  speaker  verification,  we  verify 
whether  the  observation  X  truly  belongs  to  A*  or  not  (i.e., 
accept/reject). 

2.2.  GMM-UBM  Baseline 

The  most  recognized  text-independent  system  uses  Gaussian 
Mixture  Model  (GMM)  to  represent  the  out-of-set  model  for 
outliers  (e.g.  UBM)  and  to  adapt  the  speaker  into  the  in¬ 
set  speaker  model  with  Maximum  A  Posteriori  (MAP)  [4]  [2]. 
A  speaker  model  is  represented  by  M  components  of  Gaus- 
sians  trained  from  the  D  dimensional  observation  vector  Xf 
A  GMM  is  denoted  as  A„  =  (cOnm,  fJ-ntm  ^nm),  for  m  = 
1, ...  ,M  and  n  =  1,. . .  ,N  where  ujnm  is  the  mixture  weight 
of  the  mth  component  unimodal  Gaussian  density 
with  each  parameterized  by  a  mean  vector  p-nm  ™ti  covari¬ 
ance  matrix  which  is  assumed  diagonal 

Xnmixt)  =  - (2) 

(27r)  2  |S„m|2 

2.3.  GMM-Cohort  UBM  Baseline 

The  speaker  dependent  model  is  built  with  MAP  using  only 
mean  adaptation  from  UBM  in  Sec.  2.2,  the  resulting  GMM 
represents  a  simple  rotation  of  the  same  Gaussian  mixture 
densities  of  the  UBM.  The  acoustic  holes  caused  for  sparse 
in-set  data  are  effectively  filled  with  the  Cohort  UBM[5]. 
Since  the  cohort  UBM  is  built  with  N cohort  Nj^ev)  speaker 
data,  the  resulting  Gaussian  mixture  density  represents  a  pre¬ 
cise  acoustic  space  for  speaker  phone  information  versus  the 


UBM.  Here,  the  precise  speaker  similarity  measure  improves 
the  overall  system[6].  The  procedure  is  as  follows: 

Step  1:  Collect  a  mixture  tagged  feature  (see  Sec.  3.2  for  GMT),  = 

[p]^  [g] , . . . ,  [r] }  (p,  q,  r  arbitrary  number  of 

vector  element),  from  the  GMT  resulting  feature, 

X tagged  _  r  mix'  mix'  mix' i  1  ^  ^  m 

n  —  l^nl  1  ®n2  i  •  •  •  i  ^nT„  /  t  i  ‘  i  m 

for  the  mth  components  of  GMT,  for  enrollment  speaker  n,  1  <  n  < 
N in-set-  Each  mixture  represents  speaker  phone-like  information. 

Step  2:  Collect  equal  amounts  of  development  feature, 

Xmix  (..mix'-^s  ..mix"^  rn  .,mix^\./\-\-t  ^  ^  l\r 

i  ={Xi  \p\,Xi  [g],...,Xj  [r\\l<t<Neiev, 

corresponding  to  in-set  data,  Both  speaker  features  should 

have  the  same  number  of  mixture  classes.  Each  mixture  class  for 
in-set  and  development  should  have  an  equal  number  of  features, 
x™^"  [p]  =  xf  [p]. 

Step  3:  Build  speaker  models  for  both  Af and  for  each  in-set 

speaker  n. 

Step  4:  Compute  the  Nin-set  x  ^dev  acoustic  space  distance  matrix  be¬ 
tween  enrollment  and  development  GMMs  using  the  KL  divergence, 
as  follows: 

A  in-set  /  y  \ 

^  r,  1 

Step  5:  Sort  the  KL  distance  score,  and  pick  the  top  Ncohort  from  a  rank 
ordered  development  speaker  set. 

Step  6:  Build  using  the  top  N cohort  speaker  data. 

Step  7:  Adapt  the  speaker  model  from  with  in-set  data. 


3.  PROPOSED  ALGORITHM 
3.1.  Motivation 

A  speaker  recognition  system  with  sparse  enrollment  data  will 
have  a  difficult  time  in  decoding  the  legitimacy  of  speakers 
identity  given  extremely  short  test  data  2sec.  The  acoustic 
space  of  a  5sec  in-set  speaker  data  is  far  from  what  is  needed 
to  represent  the  entire  in-set  speaker  acoustic  space.  We  ex¬ 
ploit  an  acoustically  similar  speakers  phoneme  data  to  fill  in 
for  sparse  in-set  data[5].  A  previous  proposed  system  [6]  en¬ 
ables  us  to  exploit  the  specific  speaker  phone  information, 
and  it  briefly  noted  in  Sec.  3.2.  For  exceptionally  short  test 
data  (2sec),  the  speaker  model  should  not  misrecognize  the 
phones,  which  have  been  trained  for  the  enrollment  stage. 
The  robust  distinction  for  trained  phones  would  impact  sys¬ 
tem  performance.  A  longer  test  utterance  than  training  in-set 
data  can  take  advantage  of  deciding  which  phone  information 
is  filled  or  needs  to  be  filled.  Since  the  test  data  is  2~6sec, 
the  test  data  is  instantaneously  categorized  and  quantized  to 
each  mixture  of  GMM  “on  the  fly”.  We  assume  that  each 
mixture  of  GMM  represents  the  speaker  phone  information. 
Consequently,  the  emphasis  on  speaker  modeling  using  test 
speakers  phone  distribution  information  effect  the  better  rep¬ 
resentation  of  speaker  model  for  the  given  test  data. 
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3.2.  GMM  Mixture  Taggmg(GMT) 

The  short  amount  of  data  requires  exploiting  information 
from  acoustically  similar  speakers.  Additionally,  the  data 
separation  enables  us  to  build  a  discriminating  model  for  spe¬ 
cific  targets.  The  phoneme  is  one  category  to  parse  the  speech 
information.  We  employ  GMM  to  represent  the  speaker 
phone  information  by  each  mixture.  The  GMM  is  built  with 
developments  and  in-set  speakers  data,  so  we  call  this  the 
Speaker  Independent  GMM  (S.I.  GMM).  The  speech  feature 
frames  are  tagged  with  the  highest  probability  mixture  of  the 

5.1.  GMM.  The  test  feature  frames  are  also  labeled  with  the 

5.1.  GMM,  when  claimed  speaker  provides  his/her  speech  into 
system. 

3.3.  Sweet  16  On  The  Fly 

The  primary  procedure  here  is  similar  to  that  presented  with  in 
Sec.  2.3,  with  the  major  difference  being  that  a  histogram  of 
the  mixtures  is  used  to  tag  feature  frame  data.  The  procedure 
to  build  the  in-set  speaker  model  is  as  follows: 


Fig.  1.  Block  diagram  of  Sweet  16  OTF.  Each  step  is  described 
in  Section  3.3 

Step  1:  Select  the  most  acoustically  similar  speaker  set  for 
each  in-set  speaker  n,  1  <  n  <  Nin-set- 

Step  2:  Label  the  in-set  and  development  speech  feature 
frame  data  with  a  32  mixture  class  using  GMT 

Step  3:  The  process  continues  by  counting  the  most  occur¬ 
ring  16  mixture  classes(top  16)  and  the  least  occur¬ 


ring  16  mixture  classes(bottom  16)  for  the  claimed 
speaker’s  feature.  Make  a  mixture  histogram  for  the 
claimed  speaker. 

Step  4:  Pool  the  top/bottom  frame  data  of  the  selected  co¬ 
horts  and  construct  a  cohort  GMM  as 
^bottom-cohort  j-jjg  claimed  speaker’s  histogram. 

Step  5:  Using  and  hbf>ttom-cohort 

model  for  the  mean,  covariance,  and  mixture  weights, 
build  the  in-set  speaker  model  and  us¬ 

ing  MAP  with  the  corresponding  claimed  speaker’s 
histogram. 

Step  6:  Combine  models  A^°p  and  to  build  the  final 

in-set  speaker  model. 

4.  EXPERIMENTAL  RESULTS 

4.1.  Fisher  Corpus 

An  experiment  is  performed  to  evaluate  in-set/out-of-set 
speaker  recognition  with  the  telephone  conversation  corpus, 
FISHER.  The  selected  60  speakers  are  comprised  of  in-set 
and  out-of-set  speakers.  We  make  three  different  groups  of 
in-set/out-of-set  speakers  to  evaluate  group  size,  15in/45out, 
30in/30out,  and  45in/15out.  All  60  speakers  are  devoted 
to  the  in-set  or  out-of-set  groups  with  50  randomly  chosen 
combinations  for  three  different  groups.  The  development  set 
consists  of  378  speakers  having  30  sec  of  speech  data.  The 
analysis  window  size  is  set  to  20  ms  with  a  10  ms  skip  rate. 
Static  19-dimension  Mel-Frequency  Cepstral  Coefficients 
(MFCC)  are  extracted  and  used  for  statistical  modeling.  Si¬ 
lence  and  low-energy  speech  parts  are  removed  using  an 
energy  based  detection  technique. 

4.2.  Evaluations 

4.2.1.  Basline  System 

The  speaker  GMM  consists  of  32  mixtures  to  represent 
speaker  traits  for  the  short  training  data.  The  UBM  model 
will  reflect  the  out-of-set  speaker  model  or  outlier,  and  it  is 
built  with  60  randomly  selected  speakers  from  among  the 
378  speaker  development  set.  The  remaining  318  speakers 
are  used  to  represent  a  potential  cohort  speaker  pool  to  fill 
acoustic  holes  for  the  in-set  speaker,  and  we  note  that  this 
318  speaker  set  does  not  overlap  with  the  60  speakers  used 
for  the  UBM.  The  top  5  cohort  speakers  are  selected  across 
all  Cohort  evaluation  based  on  the  UBM  system.  With  these 
selected  cohort  speakers,  each  in-set  speaker  cohort  model  is 
built  with  150  sec  of  data.  This  cohort  model  is  then  adapted 
with  the  5sec  in-set  training  data  via  the  MAP  algorithm. 

Sweet- 16  is  first  introduced  in  a  previous  study  [6],  and 
the  present  On-The-Fly  (Sweet- 16  OTF)  training  method  was 
presented  in  Sec.  3.3.  The  primary  difference  is  that  the  5sec 
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training  data  histogram  is  used  to  rank  the  mixture  tagged 
data,  as  opposed  to  using  the  test  data  histogram.  Table  1 
shows  that  the  Sweet- 16  OTF  improves  in-set  speaker  recog¬ 
nition  EER  by  an  average  2.17%  absolute  over  the  Sweet-16, 
and  an  average  4.03%  absolute  EER  over  the  GMM-UBM 
Baseline  system  using  only  2sec  of  test  data. 


Table  1.  EER(%)  performance  comparison  using  2sec  test 
data. 


EER 

15in/45out 

30in/30out 

45in/15out 

GMM-UBM  Baseline 

30.62 

31.27 

31.55 

GMM-Cohort  UBM  Baseline 

32.96 

32.10 

30.43 

Sweet- 1 6 

26.71 

29.13 

32.02 

Sweet- 1 6  OTF 

25.27 

26.77 

29.30 

4.2.2.  Sweet-16  OTF 

The  proposed  Sweet- 16  OTE  algorithm  employs  a  cohort 
speaker  group  of  5  speakers,  the  same  size  which  is  used  for 
the  GMM-Cohort  Baseline  system.  The  combined  weight  ra¬ 
tio  is  set  to  7:3  for  the  top  and  bottom  GMM  speaker  model. 
The  resulting  mixture  weights  of  the  GMM  will  not  sum 
up  to  1  because  of  the  blending  of  the  two  models,  so  this 
issue  needs  to  be  addressed  in  future  work.  By  employing 
the  mixture  tagged  test  data  histogram,  the  system  improves 
EER  on  average  2.34%  over  Sweet-16  on  2  and  6  sec  test 
data.  Eig.  2  shows  that  the  equal  error  rate  is  reduced  by 
between  2.2%~6.49%  absolute  value  over  the  GMM-UBM 
Baseline.  Eig.  2  also  indicates  that  a  smaller  in-set  group 
tends  to  produce  a  lower  equal  error  rate.  The  large  in-set 
group  increases  the  distinction  perplexity  between  in-set  and 
out-of-set  models,  and  therefore  we  expect  a  higher  EER 
for  large  in-set  groups.  In  summary,  the  proposed  method 
impacts  system  performance  by  focusing  the  expected  phone 
information  data  and  harvesting  unseen  phone  information 
collected  from  feature  frame  level  data. 

5.  CONCLUSIONS  AND  FUTURE  WORKS 

In  this  study,  we  have  developed  a  novel  strategy  to  enforce 
an  improved  data  training  balance  for  the  speaker  model  us¬ 
ing  the  expected  phone  information  from  a  test  data  mix¬ 
ture  tagged  histogram  for  2sec  test  data.  The  Sweet- 16  strat¬ 
egy  improves  acoustic  hole  filling,  resulting  from  the  limited 
in-set  speaker  data.  Evaluations  were  performed  the  “land¬ 
line  telephone  channel”  from  EISHER  corpus  to  avoid  hand¬ 
set  variation,  and  focus  on  acoustic  hole  filling.  The  pro¬ 
posed  Sweet- 16  OTF  training  method  improves  in-set  speaker 
recognition  EER  by  2.2~6.49%  absolute  with  2~6sec  of  test 
data.  Euture  work  could  consider  expecting  the  method  to 
normalize  for  handset  variation  effect  with  the  EISHER  cor¬ 
pus  so  that  cohort  speakers  can  be  selected  from  any  corpus. 


15ln/450ut  30ln/300ut  45ln/150ut 


Fig.  2.  Performance  (in  terms  of  EER(%))  of  baseline 

and  proposed  algorithm  on  FISHER,  using  in-set/out-of-set 

speaker  sizes  of  15/45,  30/30  and  45/15. 
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